Homes built after 1980 are more likely to have larger living areas, multiple stories, and more bathrooms. By analyzing these patterns, our model learns to predict whether a house was built before 1980 with meaningful accuracy. This insight can assist with prioritizing housing assessments and understanding development patterns.
QUESTION|TASK 1
These visualizations show potential relationships that a machine learning model could use to split the data. For instance, the living area (Chart 1) suggests that post-1980 homes are generally larger. Bathroom count (Chart 2) shows a shift in distribution where newer homes more often include additional bathrooms. Lastly, stories (Chart 3) indicates that single-story homes may be more common in earlier decades. These patterns can serve as helpful split points in decision trees or contribute predictive value in models like random forests or logistic regression.
Show the code
# Chart A: Grouped bathroomsplt.figure(figsize=(8, 5))sns.countplot(data=df, x='numbaths_grouped', hue='before1980_label', palette=custom_palette)plt.title('Grouped Number of Bathrooms vs. Year Built')plt.xlabel('Number of Bathrooms')plt.ylabel('Count')plt.tight_layout()plt.show()
Show the code
# Chart B: Living area boxplotplt.figure(figsize=(8, 5))sns.boxplot(data=df, x='before1980_label', y='livearea', palette=custom_palette)plt.title('Living Area by Year Built')plt.xlabel('Year Built Category')plt.ylabel('Living Area (sqft)')plt.tight_layout()plt.show()
Show the code
# Chart C: Proportional storiesstory_prop = df.groupby(['stories_str', 'before1980_label']).size().reset_index(name='count')story_total = story_prop.groupby('stories_str')['count'].transform('sum')story_prop['proportion'] = story_prop['count'] / story_totalplt.figure(figsize=(8, 5))sns.barplot(data=story_prop, x='stories_str', y='proportion', hue='before1980_label', palette=custom_palette)plt.title('Proportion of Story Count by Year Built')plt.xlabel('Number of Stories')plt.ylabel('Proportion')plt.tight_layout()plt.show()
QUESTION|TASK 2
A Decision Tree Classifier was initially selected to label homes as built before or after 1980, using the features living area (livearea), number of stories (stories), and number of bathrooms (numbaths). The model was tuned with max_depth=5 and min_samples_leaf=50 to reduce overfitting while retaining interpretability for public health staff. The resulting test accuracy was approximately 78.5%, which is a solid baseline but does not meet the 90% target.
Additional models were explored:
Logistic Regression: around 78.7% accuracy; limited by its linear nature.
k-Nearest Neighbors (k-NN): approximately 88% accuracy, but highly sensitive to data scaling and local outliers.
Random Forest: approximately 80.3% accuracy, with stronger generalization than a single tree, and offering ranked feature importances for interpretability.
None of the models tested reached the 90% accuracy goal. However, Random Forest provides the best balance of interpretability, robustness, and predictive performance for a classification baseline. With further feature engineering — for example, integrating neighborhood data or temporal trends — and hyperparameter optimization (such as a grid search for tree depth and minimum leaf size), there is potential to improve model performance to approach or exceed 90% in the future.
In the current scope, the Random Forest is recommended as the best candidate for a production baseline. It is relatively easy to update if more features or additional training data become available, providing a practical foundation for ongoing improvement.
QUESTION|TASK 3
Feature importance analysis revealed that living area (livearea) was the strongest predictor of whether a home was built before or after 1980. This makes sense because homes constructed after 1980 tend to follow modern architectural trends favoring more spacious floorplans, in contrast to smaller post-war homes built before stricter asbestos regulations.
The second most important feature was number of bathrooms (numbaths). Newer homes typically include more bathrooms to match contemporary expectations for convenience and functionality, making bathroom count a reliable indicator of more recent construction.
The number of stories (stories) feature also contributed to the classification model, although with a lower importance score. This is still valuable because single-story homes were historically more common in earlier decades, whereas modern subdivisions often include two-story designs.
Show the code
# Decision Tree importanceimportance_tree = pd.Series(clf_tree.feature_importances_, index=features.columns).sort_values()plt.figure(figsize=(8, 5))importance_tree.plot(kind='barh', color='lightseagreen', edgecolor='black')plt.title('Decision Tree Feature Importance')plt.xlabel('Importance Score')plt.tight_layout()plt.show()
These patterns are visualized in the accompanying feature importance chart below, which shows the ranked contribution of each feature to the model. The chart confirms that living area, number of bathrooms, and number of stories are the dominant variables. Overall, these variables align with real-world domain knowledge about housing design and construction patterns and justify the model’s predictions in a way that is explainable and transparent for stakeholders.
Further improvements could include adding neighborhood-level attributes or temporal price trends to enhance predictive power.
QUESTION|TASK 4
I evaluated the classification models using three common metrics: accuracy, precision, and recall. Each provides a different perspective on model performance:
Accuracy measures the overall proportion of correct predictions across both classes. The Random Forest achieved approximately 80% accuracy, slightly higher than the Decision Tree (78%) and Logistic Regression (79%). However, accuracy alone can be misleading if the classes are imbalanced, which is why we also examine precision and recall.
Show the code
print("Decision Tree Evaluation:")print(classification_report(y_test, y_pred_tree))print(confusion_matrix(y_test, y_pred_tree))
Precision measures how many predicted positives were actually correct. For example, Random Forest had a precision of 0.84 for the positive (before1980) class, meaning when it predicts a home is pre-1980, it is correct 84% of the time. Precision is especially important if a false positive (wrongly classifying a newer home as old) has public health or safety consequences, such as unnecessary asbestos remediation.
Recall measures how many true positives were captured among all actual positives. The Random Forest achieved a recall of 0.85 for the positive class, meaning it correctly identified 85% of homes that were truly built before 1980. High recall is crucial if you want to avoid missing any potentially hazardous homes.
As a balanced measure, the f1-score combines precision and recall, showing the Random Forest at 0.84 for the positive class, which is a solid compromise between missing too many cases and misclassifying safe homes.
Interpretation of confusion matrices shows most of the model errors were between these borderline homes built near 1980, which is expected. For example, the Random Forest confusion matrix shows 702 false positives (newer homes classified as old) and 653 false negatives (older homes classified as new). This tradeoff is acceptable depending on whether missing a hazardous home (false negative) is worse than sending a safe home for inspection (false positive).
Overall, while no model reached the 90% accuracy target, the Random Forest achieved the best balance of precision, recall, and interpretability, making it the most practical choice for a production environment. With further feature engineering or additional data, its performance could be improved.
STRETCH QUESTION|TASK 1
For this stretch question, I tested three different algorithms to classify whether a home was built before 1980: Random Forest, Logistic Regression, and XGBoost. Each model was evaluated with a confusion matrix and either feature importance or coefficient values.
At first, all three models showed perfect or near-perfect accuracy, which seemed too good to be true. After checking, I realized the problem was that I had included yrbuilt as a feature, which is basically the answer to whether a house was built before 1980. Including it let the models “cheat” by memorizing the target, which is why the accuracy was 100%.
Random Forest’s feature importances and XGBoost’s results both confirmed this, because yrbuilt was by far the most dominant variable. Logistic Regression showed the same thing, with a huge coefficient on yrbuilt.
If yrbuilt is removed (which it should be, since you wouldn’t know it when predicting), Random Forest is still the strongest option. In earlier testing without yrbuilt, it achieved around 80% accuracy with a good balance of precision and recall. That makes it the most reliable recommendation for the client right now, with potential for improvement if more features or neighborhood data are added in the future.
After merging the neighborhood data with the dwellings data, I reran the same three algorithms: Random Forest, Logistic Regression, and XGBoost. All three models showed perfect or near-perfect accuracy again, with 100% classification rates, which is a strong indicator of data leakage. This happened because the yrbuilt column was still included as a feature after joining, which lets the models essentially memorize the answer.
The feature importances and coefficients confirmed this: yrbuilt was still the dominant driver in every model, overwhelming the effect of the new neighborhood variables. The added neighborhood features (like the nbhd_ variables) contributed almost nothing, showing extremely low or even zero importance scores.
Because of this, the results with the merged dataset do not actually change the recommended model. If we remove yrbuilt from the features, Random Forest would still likely perform best, similar to its ~80% accuracy from earlier runs. The neighborhood features might provide a small boost if the model is properly cleaned of leakage, but on their own they did not shift the model’s decision boundaries in a meaningful way.
In short, joining the neighborhood data did not meaningfully improve the models when yrbuilt was present, but could be helpful in the future if handled carefully and after removing data leakage.
For this stretch question, I built a regression model to predict the year a house was built using a Random Forest Regressor. The model achieved a root mean squared error (RMSE) of about 11.6 years, meaning on average predictions were within roughly 12 years of the true build year. The median absolute error was lower, at around 3.9 years, which shows that half of the predictions were off by less than four years — a good sign that the model handles most houses reasonably well, with a few larger outliers.
The R² value was approximately 0.901, which means the model explained about 90% of the variance in the year built. Overall, this is a strong score for a regression problem with a complex target like construction year.
While the random forest did a good job predicting year built, there is room to improve by adding more external features, such as neighborhood development data or historical zoning codes, to tighten the RMSE even further. Still, this model would provide a solid starting point for helping the client estimate missing build years when needed.
Show the code
from sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_squared_error, r2_score, median_absolute_errorimport matplotlib.pyplot as pltimport pandas as pdfrom sklearn.model_selection import train_test_splitimport numpy as np# Load your confirmed datadf = pd.read_csv("https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv")# set targety_reg = df["yrbuilt"]# drop target and identifierX_reg = df.drop(columns=["yrbuilt", "parcel"])X_reg = X_reg.fillna(0)# splitX_train, X_test, y_train, y_test = train_test_split( X_reg, y_reg, test_size=0.2, random_state=42)# Random Forest Regressorrf_reg = RandomForestRegressor(n_estimators=100, random_state=42)rf_reg.fit(X_train, y_train)# predicty_pred = rf_reg.predict(X_test)# evaluationrmse = np.sqrt(mean_squared_error(y_test, y_pred)) # fixes the errorr2 = r2_score(y_test, y_pred)medae = median_absolute_error(y_test, y_pred)print(f"RMSE: {rmse:.2f}")print(f"R²: {r2:.3f}")print(f"Median Absolute Error: {medae:.2f}")# residual plotresiduals = y_test - y_predplt.scatter(y_pred, residuals)plt.axhline(0, color="red")plt.xlabel("Predicted Year Built")plt.ylabel("Residuals")plt.title("Residual Plot")plt.show()# feature importancespd.Series(rf_reg.feature_importances_, index=X_reg.columns).sort_values().plot( kind="barh", figsize=(10,12))plt.title("Feature Importances for Year Built Regression")plt.tight_layout()plt.show()